Skip to content

Conversation

@IvanKobzarev
Copy link
Contributor

@IvanKobzarev IvanKobzarev commented Feb 9, 2026

Adding DSv3 SimpleFSDP + inductor auto_bucketing passes to H100 CI.

We need H100 to go through grouped_mm path.

Testing:

rm -rf /tmp/test_output && python -m tests.integration_tests.run_tests /tmp/test_output --test_suite h100 --test_name simplefsdp_deepseekv3_auto_bucketing --ngpu 8

Originally I wanted to add this CI to AutoParallel. But we can not do it as it runs on A100 and goes through non-groupedmm moe part.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 9, 2026
@IvanKobzarev IvanKobzarev requested a review from fmassa February 9, 2026 16:47
ngpu: int = 4
disabled: bool = False
skip_rocm_test: bool = False
config_file: str | None = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you have to add this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I tried to avoid it, but by default it using different toml.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we designed the last version, we explicitly avoid having a config_file field, but choose to override on top of the base_config.toml, as they share most of the fields between the base_config and other debug model configs.

Copy link
Contributor Author

@IvanKobzarev IvanKobzarev Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tianyu-l @wwwjn Yes, we can avoid it. Overrides work fine - I just put config file as argument in the list with other args, the latest override will have precedence. (updated the PR)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SG. It seems CI failed for a different reason?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, smth with nvlink sharp that I have seen without this change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/8gpu CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants